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ABSTRACT 

Performance measures used in contemporary analyses suffer 
from a number of shortcomings when real-time multiprocessors 
are considered. Most of these measures do not take account of the 
needs of the application, except perhaps in a subjective manner. 


In this paper, we consider some new performance measures to 
characterize fault-tolerant multiprocessors used in the control of 
critical processes. Our performance indices are based on con- 
troller response time. By relating this to the needs of the applica- 
tion. we have been able to derive indices that faithfully reflect the 
performance of the multiprocessor in the context of the applica- 
tion. that permit the objective comparison of rival computer sys- 
tems. and that can either be definitively estimated or objectively 
measured. 


An example of a controller in an idealized satellite application 
is provided. 
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1. Introduction 

1.1. Motivation 

Over the last few years, Inexpensive, powerful, and reliable microprocessors 
have become available. At the same time, analytical simulation and modelling 
techniques for use in computer communication networks have been developed. 
Multiprocessors are therefore becoming attractive. One special application to 
which they can contribute a great deal is reliable control. 

Several reliable multiprocessors have been proposed and a smaller number 
built. Among the latter, one might mention the Fault Tolerant Multiprocessor 
(FTMP) [l] and the Software Implemented Fault Tolerance (SIFT) machine [2], 
both built under contract to NASA for the control of civilian aircraft of the next 
decade. Reliability requirements are stringent: the benchmark figure employed 
by NASA is that the probability of controller failure should not exceed 10 -# for a 
ten hour flight. Naturally, such a benchmark begs the question of how "failure" 
or the "performance" of a complex multiprocessor system is to be defined. In 
this paper, we present some measures that are appropriate in this context and 
by extension, in the context of other control applications - nuclear reactors, 
life-support systems, etc. -- where controller failure can lead to catastrophic 
consequences. 

1.2. The Nature of the Control Function 

A computer controller consists of three communicating parts. These are 
the data acquisition, data processing and output sections. The data acquisition 
section consists of sensors, input panels and other associated equipment; the 
processing section consists of the computer (in our case the multiprocessor) 
and the output section consists of mechanical actuators, displays and other out- 
put devices. The system may logically be regarded as a three-stage pipe*. 

1. Indeed, the M*FCS system [18] of the United States Air Force is physically configured in 
this manner. 
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The aet of tasks to be exeouted by a control system is predetermined and 
the nature and behavior of the software kirnwn In advance - at least In outline — 
to the designer. The multiprocessor controller Is therefore a rather specialized 
device; It might Indeed be considered custom-built for a particular application. 
This fret makes it both easier and more necessary to obtain a reasonably good 
performance analysis of the system. 

The controller software in the processing section consists of a set of tasks, 
each of which corresponds to some job to be performed in response to particu- 
lar sets of environmental stimuli. 

1.3. The Need Fbr New Performance Indices 

The determining characteristic of a computer controller's performance is 
a combination of reliability and high throughput. The throughput requirements 
arise from the need for quick system response to environmental stimuli. Speed 
is of the essence in a real-time controller since failure can occur not only 
through hardware failure in the system, but also on account of the system not 
responding fast enough to events in the environment. This fact imposes a prem- 
ium on controller response time and leads naturally to the work presented in 
this paper. 

As a result of these special performance requirements, performance meas- 
ures used to characterize general-purpose uniprocessor systems are no longer 
appropriate for multiprocessor controllers. Conventional reliability, throughput 
and availability by themselves alone have little meaning in the context of con- 
trol; a suitable combination of these is necessary. New performance measures 
are required; measures that are congruent to the application, permit the 
expression of specifications that reflect without contortion the true system 
characteristics and application requirements, in addition to allowing an objec- 
tive comparison of rival systems for particular applications. 
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We cannot strata too heavily that it is meaningless to spaak of tha perfor- 
mance of a computar out of tha context of its application. Tha form tha perfor- 
mance measures take must reflect tha needs of the application, and tha com- 
putar system must be modelled within this context. The multiprocessor con- 
troller and the controlled process form a synergestic pair, and any effort to 
study the one must take account of the needs of the other. 

It is important that performance measures should depend on parameters 
that can be definitively estimated or objectively measured. Much of the work 
published so far on characterizing the "goodness of f ir ’ between the attributes 
offered by a computer system (reliability, throughput, etc.) and those required 
by the application depends on parameters obtained through essentially subjec- 
tive analyses. (See, for example, [3]). It has been our policy in this paper, how- 
ever, to always base performance indices on experimentally-measureable quan- 
tities, i.e., system response time for the various system tasks. 

1.4. Organization of Paper 

i 

In Section 2, we discuss our new performance measures and in Section 3 
means for practically obtaining these quantities. An example is presented in 
Section 4. and the paper concludes with Section 5. 

2. Performance Measures 
0.1. Survey 

The concept of systems that are, through built-in redundancy, much more 
reliable than any of their components is not new. Some pioneering work was 
done by von Neumann [4] and by Moore and Shannon [5], both in 1956. 

Work in modelling fault-tolerant multiprocessors was done by Bouricius ef. 
al, [6j, who took into account the impact of transient and permanent failures, 
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Borgs non and Frietas, [7], who studied the PRIME computer as a gracefully 
degrading system, and other researchers who, like their precursors, used Mar- 
kov models In the representation of these systems [B]-[ll]. 

With only one exception ([£]), the work referred to above assumed tradi- 
tional performance measures. Other workers recognized the shortcomings of 
these and defined new ones. Among these, one might mention Beaudry [12] and 
Meyer [13]. 

The chief drawback of Beaudry's measures is that they express the capabil- 
ity of the system as a whole, not with respect to particular tasks. The computer 
is treated as a monolith in terms of the services it delivers. The implicit presup- 
position is that all tasks are qualitatively of the same form and have practically 
the same "cost" as a function of the response time 8 . However, the tasks that a 
real-time computer controller executes are typically widely varying both in 
terms of the demands they make on various entities of the multiprocessor and 
in the fact that they are of unequal importance. The performance measures of 
Beaudry thus express the performance of the computer without reference to the 
particular needs of the application. 

Meyer's concept of performability represent «• a useful advance in the 
search for appropriate performance measures. However, the actual perfor- 
mance measures used sometimes require further refinement. For example, in 
[13], Meyer employs as the performance measure, Y„ the fraction of arrived 
tasks that are processed by a system S in an interval of time [0,t]. This, as in the 
case of Beaudry's performance measures allows for no discrimination between 
individual controller tasks. 

To remove, at least partially, the limitations of the work cited above, we 

present here some new performance measures. 

2. Or equivalently, that it doe* not matter if they do not. 
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Definitions and Buie Conoapts 

Because of the stochastic nature of the * transient and/or permanent 
failures cf components In a multiprocessor, the load-dependent blocking at 
shared resources, and conditional branches in the software, a probabilistic 
model Is required to properly represent the behavior of a multiprocessor con- 
troller. A state-space model can be formulated, with the states representing the 
current operational capacity of the system*. Multiprocessor response time Is a 
function of the state. (Response time Is the time Interval between the moment of 
task Initiation and the actuator and/or display result that occurs.) 

A task Is triggered when some set of events in the operating environment 
Initiates its execution. If the environment Is stochastically stationary, there are 
intertrigger distributions. 

Every time a task is triggered, a unique version of it is created. This version 
Is called an extant version until its execution Is complete. Versions are num- 
bered in sequence of triggering: successive versions of task i being denoted by 
2u. Ej,t, etc. The response time associated with a version Eg is denoted by 
RE5P(ij), under the assumption that the system continues operating until the 
version completes executing. (This is clearly only valla so far as the expected 
response time is finite). The extant time of a particular extant version of task l 
is the difference between the absolute (or system) time at present and that at 
the moment of tr.ggering. When an extant version finishes executing, its extant 
time is frozen by definition. Thus, the extant time at t of a version Hg that was 
triggered at time Tg with the system in state ng is defined by 
EXT[Sg,Tg,ng,t]*mlrf t-Tg. RESP(ij)]. 

A controller task may be critical or non-critical. A critical task has a hard 


3. Tie actual definition of tie ■Ui.es depend* on the syr.err. that :» being rr.ode'.’.ed. C hooting 
a suitable formulation to facilitate lyatem analysis la often a challenging task. 
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deadline [ 18 ], th# violation of whioh loada to dynamic failurt. Tht daadlinaa art 
gtntrally random variable*. Noa-critlcal task* have no auoh deadlines 4 . 

The mieaion lift time i a the duration of operation between aucoeeelve eer* 
vice atagea. 


2.3. The Performance Heaeurea 

We define a coat function C t ({) aaaoclated with ayatem response time ( for 
task 1. For critical tasks, the cost function takes the following form: 

, , k s (0 

^(0 * f - otherwise ( *) 

where gj(f) is a suitable continuous function of ( that increase* monotonically in 
[0. t 4 i], is zero for all {>tdi. and we recognize t*. as the hard deadline associated 
with critical task l* For noncritical tasks, CM) is continuous and therefore 
always bounded for a finite response time. For consistency, we assume that the 
costa accrue as execution proceeds. In other words, if task 1 was triggered r 
units ago and has not yet been completed, its contribution to the cost so far 
accrued is C.(t). 

The cumulative cost function associated with task i. ri(t), is defined as fol- 
lows: 

*(t) 

r t (0 « i C 1 (EXT(S u ,T U .n u ,t)) (2) 

J-l 

where there have been qi(t) triggers for task i in the interval [0,t]. The system 
cost function in a system with r different tasks is defined as: 

4. Salt that dynamic failure encompasses the traditional notion of catastrophic hardware 
failure as well as otner causes (software crashes, system reconfiguration, electromagnetic in- 
terference on the communications network, etc.) of missing one or more hard deadline. 

8. In practice, all that can be assured of the behavior of gj,() is that it la monotonically non- 
decreasing in [0, t e: j. However, we shall need a invert tine cost function, and so, we assume 
a strictiy increasing cost function :n [0, t^J. 
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3<i)-£r,<t). (3) 

M 

Let L(t) » Prob { Mission Lifetime * t ), Q the aet of atatea of the controller, and 
Ac Q the aubaet of atatea in which failure la not certain. Note that failure la cer- 
tain when the expected response time goea to infinity. Define B ■ Q-A. Our per- 
formance meaaurea are then': 


Coat Index. K(x) ■ f f'robjS(t) tf * I dL(t) 

*b 


Probability of dymmic failure, ^*Prob |Slt) « 00 j dL(t) 

Mean Coet, M ■ ^EjS(t) | system never enters state set B|dL(t) 
Variance Cost. V ■ J\ arjS(t) | system never enters state set B{dL(t) 


(4) 

( 5 ) 

(6) 
(?) 


We define also the following auxiliary measures: 


Cost Index for Task i, Kt(y) Probjft(t) « y{dL(t) 


( 9 ) 

( 0 ) 


Kean Cost for Task i, E jr t (t) | system never enters state set B{dL(t) 

Variance Cost for Task i,Vi«^*Vsr|r,(t) | system never enters state set B|dL(t) (10) 

The auxiliary measurer are principally for use during the design phase. 

The mean and variance costs and the co*t Indices can be extremely tedious 
to compute. In their place, it is often useful to substitute the following modified 
parameters: 


Modified Cost Index. R(x) « 


|^Prob|S(t)*xldL(t)IU -p 4y »{ rorx<- 

Pdya for ‘ 

m 

Mean Modified Cost, fl ■ y*E|S(t) | system never enters state set B|dL(t) (12) 


Variance Modified Cost. V ■ AR|3(t) | system never enters state set BJdL(t),(l3) 


6. The Coe; Index :» generally only for use when rival multiprocessors are beins compared. 
. - the evaluation of a ninyla system, the other three measures are sufficient. 
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and the auxiliary meaaurea Kj, fit, and V| are defined analogously, with 

1*1(0 -gg,(REMlj» (14) 

and S(t) * 2r,(t). Also, from physical considerations, there exists a T<«* such 
!■! 

that for all t*T. L(t)»l, so that the above integrals always converge. 

For values of t much greater tb:.~i either the mean task completion time or 
lntertr^ger interval, thr. modified costs are good approximations to their exact 
counterparts. This is of value, since It applies most mlsron lifetimes, (for air- 
craft, mission lifetime can range from 30 minutes to 10 hours, while for space- 
craft, It can range up to several months) and the modified parameters are much 
easier to compute. In the examples presented In this paper, we evaluate modi- 
fied values throughout. 

Let P*(t) be the probability that at time t the system is In state i/eQ, and 
the probability that the system is in state a e A. given that in a mission 
lifetime of {, it never enters a state in B Let P«rnu(t)dt be the probability of 
arrival of one version of task i in [t.t+dt], and v u and W u the density and distri- 
bution functions, respectively, of the controller response time for task 1 when in 
state a Further, denote the distribution of the hard deadline for task 1. if criti- 
cal. by F*!. Then, 

• » I • 

m & L l* L [ 1 ” W ^ M)]P ‘ W(t)P '"« (t) dtdUOdFJw) ♦ l ' tg f 9 P>(0dt ( 1 5) 

^^^^-(r-OEjp.COdt ( 18 ) 

where the approximations hold whenever (as is almost invariably the case in 
practice) each of the integrals is much less than one. 

Lot be a random varable denoting the cor . . button to T*. of a single ver- 
sion of task l. with the system in state a e A, tot(i) the value of T*i over a mission 
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lifetime given that the eyetem remain* in eUte-eet A throughout, end 

r 

ayat ■ £tot(i). Let n«T|i)(u.d) be the probebillty of u arrival* of version! of task 1 in 

an 'ntervel of length t), end Hj the distribution for the sojourn time of the system 
in state v in a mission of length 


Then, define the following characteristic functions: 




. «,W 

.£e"*V*(gf ‘(q))dq dF*(t) if 
f e"^wi.,(gi -l (<i))dq otherwise 


task i is critical 


(17) 


for all a e A 


. - «■ 


*•>«(*) = £ E / /l*C..( a )r n »rr(l)(u.>>) d H. l (d) d L(t) (18) 

•cA u-0 t-0 0 “* 


and 


P 

*•7.1(3) = (19) 

IPI 

The reader will have noticed two raplicit assumptions: (a) that the service 
time for a task is much less than the sojourn time of the system at any one 
state,’ and (b) that costs incurred by each task-version are independent. 

The Cost Indices can now be determined by inverting Vu*<i)(s) and and 

weighting with the appropriate probability of dynamic failure. 


2 . 4 . Application of the Performance Measures 

The mean cost, variance cost and the probability of dynamic failure cam all 
be used in the a ;$ign and evaluation of individual systems. Every such design is 
the result of a multiplicity of decisions regarding scheduling strategy, individual 
component redundancies, speed differentials, etc. The performance measures 


7. State changes occur when hardware or software failures occur in the system. The rate at 
which these occur s far smui.cr than tie mean ten « completion time. A.so, of course, g, is 
invertible in [0, t^J. 
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oan be used u„ optimisation criteria In this context. 

It is suggested that the measures be used In a two-stage process. First, the 
probability of dynamic failure of the designed system is computed and this com* 
pat ad with that required by the specifications. In the event that the system 
meets the dynamic failure requirements, expressions for the mean and variance 
costs are derived, and fine-tuning of the design carried out by studying the sen- 
sitivity of ? e measures to changes in system structure, speed, etc. If the sys- 
tem does not meet the dynamic failure requirements, the mean and variance 
costs are not computed. Instead, the system is redesigned to the point where 
the failure requirements are met before mean and variance costs are con- 
sidered. Figure la summarizes the approach. 

The System and the Task Cost Indices may be regarded as the distribution 
functions of random variables that we shall call the system capability and the 
task capability respectively. The system capability can be used in the com- 
parison of two or more rival computer systems. Figure lb explains the pro- 
cedure. The task capabilities are used more subtly. See Section 3.3. 

3. Remarks on Determining the Performance Measures 

In what follows, the system consists of the controlled process (often 
referred to simply as the "process") and the controller (i.e., the multiproces- 
sor). 

Four items need to be determined, prior to computing the performance 
Indices. These are the distribution of the hard deadlines for each critical task 
i, the finite cost function g t for each task j, the multiprocessor response time 
distribution as a function of its state, and the P,(t) for all v e Q. We concentrate 
here on the first two, ’"eferring the reader to the example presented in Section 4 
and the queueing theory and probability literature for the last two. 
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Typically, critical tasks are associated with life-critical activity and so one 
is especially concerned in finding practical means for accurately estimating 
In o 'der to do so. the hard deadlines must first be derived. In most cases, since 
the environment is only stochastically known, this takes the form of probability 
distributions. 

To explain the derivation of the distribution of t^, a state-variable approach 
is useful. /. general process may be described as 

i(t) = f(*(t). u(t), t) (20) 

where x is the state vector, u is t he input vector, and t represents time. The 

input vector can be partitioned thus: 

u T = [uj | uj] (21) 

where ug is the environment input subvector 8 . It represents the effects of the 

environment on the controlled process. For example, a gust of air is an environ- 
mental input when the controlled process is an aircraft. 

uc is the control input subvector. It represents the input delivered under 
controller command in response to an environmental ; nput. For obvious reasons, 
it is always bounded. 

The process is required to perform within a certain subset of the state- 
space. Let the admissible set of the state vector be bounded and denoted by X * 
Note that the admissible state-space is not necessarily static. 

The control uc is employed to keep the process in X*. The control is clearly 
a function of ug. Since Xa is bounded, there is a bound on the controller 
response time allowed. This bound is the hard deadline. Since the process 
dynamics and the distribution of u E are known, the distribution of each hard 
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8. 9y definition, an input from an I/O panel is also an environmental input. 
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deadline can in theory be determined. 

For clarity, we restate the above. The interval between a trigger and the 
termination of the resulting controller response can be divided into two por- 
tions: controller think-timt (i.e., the response time) and the actuation time din- 
ing which the process reacts to the environmental trigger under controller 
directives The hard deadline associated with a control task i which i3 trig- 
gered by an environmental input is the maximum controller think time permit- 
ted, consistent with environmental conditions, the process dynamics and the 
requirement to keep the system within the admissible state-space, Xt- 

Clearly, in order to derive the hard deadlines, a precise formulation of the 
process dynamics is required. Since such a formulation is a required part in the 
design of a critical process, no additional requirements are imposed on the sys- 
tem designer. 

For an example in determining hard deadlines, the reader is directed to 
Section 4. 

3.2. Derivation of Finite Cost Functions 

Very little work has been published in this area. Cost functions of this type 
are usually -- and if at all -- specified in an ad-hoc manner. The main problem 
here is that the cost functions must be linked to the controlled process to have 
any concrete meaning. Contemporary workers (e g.. [3]) have skirted the prob- 
lem -- with less than convincing results -- by ascribing qualitatively-defined 
"weights" to controller attributes (conventional reliability, throughput, etc.) and 
attempting to match these to corresponding weights for the application (also 
qualitatively obtained). 

Controlled processes have performance measures that are functionals of 
system state and input, and which express the cost of running the process [17], 



[18]. The traditional formulation ia: 
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J(t) * f.(«(t). a(t). t) • (82) 

where J(t) is the instantaneous performance measure . (, the functional and the 

other symbols have their usual meanings. 

The cost of running the process over, say, an interval [to, W]. 


0 = 



We claim that a good representation for giU) is given by* 


where: 


gi(t) 


_ 


n(0 - n(o) for 


otherwise 


0 ( 17 ) = Expected contribution of u«ito 9 if RESP(i)=»j. end 
Uei = control input subvector associated with task i. 

See Section 4 for an example. 


(23) 


(24) 


3.3. Discussion 

By connecting the activity of the control system with that of the controlled 
process, a proper foundation has been provided for the definition of controller 
performance. All quantitites relating to the performance measures can be calcu- 
lated from quantities that may be estimated or measured. It is true that our 
methods presuppose a knowledge of the controlled process and its dynamics 
that may not at first be accurately available. However -- and this is the crucial 
point - an understanding of the dynamics of the process will surely increase 
with experience (say with a mathematical model or a prototype). There is 


9. It is easy to see that many other suitable representations of the cost function can exist. 
For instance, one can modify 0 (tj) to include higher moments of the contribution to 9. How- 
ever, for most practical purposes, the above measure should suffice, it is usually only when 
the intertrigger or service time distributions exhibit strongly non-Markovian behavior that 
such a modification needs to be considered. Also, £1 is 0 for values of response time greater 
than the associated hard deadline since to speak of finite cost after failure has occurred is 
meaningless. 
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tharefore * learning curve associated with process operation and hence by 
Induction another for the performance measures: Note that we are here consid- 
ering critical processes such as nuclear reactors or aircraft and that such 
processes call for extensive testing and analysis before release or installation. 
Also, the operating environment for these critical processes is usually well 
known, so that a good model generally exists for the intertrigger distributions. 

One can express uncertainty about the accuracy of the individual cost func- 
tions in terms of a confidence measure. The sensitivity of the system capability 
to changes in the cost functions is given indirectly by the individual task capa- 
bilities . The latter oan therefore be used with confidence intervals for the cost 
functions to obtain a confidence interval for the system capability. 

4. Example 

In this section, we present an example for the determination of hard dead- 
line distributions, mean and variance costs, and the analysis of a multiprocessor 
system. Both the example process and the multiprocessor are somewhat ideal- 
ized, the principal purpose being to illustrate what has gone before. 

4.1. The Process 

4.1.1. Description 

The controlled process is a communications satellite of mass m, floating in 
free space. One critical controller task (task 1) is to keep it within a sphere of 
radius R centred on a "rest position", *o- The environment is characterized by a 
series of impulses, deriving from meUorite impacts, and arriving along random 
directions. The meteorites arrive according to a Poisson distribution with 
parameter X. The satellite must be restored to rest at *o before the next meteor- 
ite comes in. The energy the meteorites impart to the satellite is assumed to be 
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constant at k units per impact. 

The controller employs rockets to respond to the impulses and can impose 
a maximum thrust of a units of force in any direction. The rockets must be 
adjusted to the right direction, and this deployment takes t, units of time. 

Tbe performance index with respect to this one task is 0 = total energy 
expended by the rockets. The problem is to determine the distribution of Ui and 
the finite cost function, g t (t). 


4.1.2. Derivation of t*i Distribution 

Catastrophic failure will occur either if the satellite is not restored to rest 
at so by the time the next impulse comes along, or if the satellite leaves the 
admissible state space. 

t dI is the maximum think-time, and finding it is equivalent to solving for the 
minimum actuation time. We clearly have a bang-bang control solution. Using 
standard methods from optimal control theory, we conclude that 

t*i = max J min[A(T), t,], 0 | ( 25 ) 

where 

t is the interval between successive meteorite arrivals, 


aw . T-T.+ _2 

a v a* a 


= x/ — 
v 2k 


R -- 
a 


The distribution function for t d i is given by: 


F d ,«) = 


0 if <■ < 0 

t -e - ** -1 * 0 if 0 £ f st j 

1 otherwise 


4.1.3. Derivation of g k 


( 80 ) 

( 87 ) 


( 88 ) 


Notice that since the environment is stochastic, the controller will always 
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order maximum thrust. The duration for which the thrust Is maintained will 
depend on the response time, All computations are made under the condition 
that {at 4 i. Using the laws of mechanics, we arrive at the following result, which 
the reader may easily verify: 

Expected contribution of u,i to 0 If response time = {, 

0»(O = n,( 0 )+ (29) 

We can therefore now write: 

ti(f) * for 0***tdi ( 3 °) 

The cost function Cj({) is now completely determined. It only remains for us to 

obtain the controller response time distribution and state probability functions. 

Burnside NoUm the tradeoff b et we e n t^i end g t . Ae the rocket thrust, a, is 
incraesod, the oipcrted nhu d t*i in creeses, insuring that men time is evsfl- 
tU* for the controller, and the chances of catastrophic failure are reduced. 

This is paid for in the form of increased operating cost. Le., gi rises faster. 


4.2. The Multiprocessor 


4.2.1. Description 

The multiprocessor is configured a i shown in Figure 2. It consists of a 
despatcher with an infinite buffer which dynamically assigns tasks to processors 
as they come in and the processors become free. Input rate is Poisson, with 
parameter X| for task i. There are r tasks in the system, and all are critical. All 
tasks have an identical service time distribution 10 . There is no common memory 
in the system, each processor is assumed to have all the system applications 

software stored in its private memory 11 . The despatcher schedule is FIFO. 

10. This assumption (used by many authors eg., [14] )is removed by us in [20]. 

11. This assumption is not as unrealistic as it might seem. With memory densities rising, and 
costs falling, there are emerging designs based on this Idea (eg., CM*FCS system [20]). The 
removal of this assumption entails analysis of a blocking problem. For a study of this type of 
system, see [22]. 
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Prooeisori fail at an axponantial rata of tho dispatcher (which la aaaumtd to 
have internal redundancy) with rate ho< end each of the q redundant buaes (of 
which only one la active at any one time) with rate tsm- Components fail indepen- 
dently of each other, and coverage is 100%. Queueing delays at buses are small 
enough to be ignored 11 . 

4.2.2. Determining Task Execution Rate 

The time taken to execute a task on a processor is a random variable whose 
distribution is affected by operating conditions and system parameters. The dis- 
tribution is determined by operating system characteristics, processor failure 
rates (both transient and permanent) and the probability of incorrect transmis- 
sion over the intercommunication medium together with the conditional 
branches within the executed code. 

In any reliable system, the effects of the pertubations due to hardware 
failures or electromagnetic interference are very small due to their low proba- 
bility of occurrence. It is generally assumed in analyses of this type that the 
software execution time assuming no pertubation due to hardware failure or 
interference is exponentially distributed. Since the pertubations are small, we 
approximate them by assuming the service rate still to be exponential, but shift- 
ing its parameter slightly to allow for the pertubatlonB. Let this parameter be 
To determine fi. we use the probability model shown in Figure 3. tt 0 is the den- 
sity function for the execution time of the software assuming no pertubations 
due to failure, incorrect transmissions, etc. This quantity can be derived 
through experiment, and naturally depends on the nature of the code. n t is the 
density function of the j-th pertubation and a, is the probability that this pertu- 
bation occurs during one execution of the code. Both these quantities can be 


12. Thia is characteristic of this type of system. Meaaagea tranamiUed are invariably either 
control messages or very small packets of data. 
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determined by experiment and/or analyals. Ae a liming, (or analytical simplicity , 
that the pertubations are mutually independent, the characterlatlc function of 
the service time pdf for the code, can be written down by inspection as: 

• nUw*i-4 (30 

where 


tj(s) is the Laplace transform of nj(t). 

The mean and variance of th ? execution time can be determined from 4,(s). 
Let them be M and V, respectively. Then, the value of tx is taken to be 


min 1/M, 1/V . 


This is generally a rather conservative estimate. 


4.2.3. Analysis of the Multiprocessor 

(a) Response lime Density Function: System failure occurs when the 

I*, 

despatcher fails, all buses fail, or when there are n = l^—J processors or less 

functioning. The state description is: 7-0 when there is system failure, and 
7 =n+l c, when there is no system failure, and there are n+1 c. proces- 
sors respectively functioning in the system. Clearly, A=jn+1 c{. and B=|0J. 

r 

Let A = The system at state x (x>0) is essentially an M/M/x queue when 

t-i 

the (small) despatcher service time is Ignored. Therefore, the pdf of the 

response time of the system in state x (x>0) for each task i = 1 r is given by 

[23]: 


where: 


w u (t) 


X - (x-Dn 


W,(0) = 1 


x(X/M)* 1 f a I 1 

x!(x— A/ /*) j! [m 


+ 


1 xm 

x! [xm*“X 


(32) 


(33) 
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(b) SUU Probability Functions: 

Let prft). po(t), and p*(t) be the probability of failure of a processor, 
despatcher and bus by time t respectively, where: 


prfO-l-e^ 

Pb( 0 ■ i - 
PDCO-l-e"^ 1 

Then, the probability of the system being in state 0 at time t, 

P.W « gfjlMOI-li-PFd)]’ ♦ + po(0 

and, the corresponding probability for state a, for a=n+ 1 c, 


(34) 

(35) 
(3«) 


(37) 


P.(t)«[l-|[ps<t)lVp I) (t)j][ 


[Pp(t)] e ’ , ll-pp(t)]']. and 


pW/»\ a ^*(t) 


(38) 

(39) 


(?) Sojourn Time Density Function, Hj(t): The derivation is conceptually 
straightforward. We therefore present only the result. 

For a=n+l c-l, 


H. T (t) 


where 


F,,(0 


l 


0 for t<0 

l-F % (T-t)|l-F M ,(t)m-F T .(t)J for 0*t<T 

1 for t»T 


for t<0 


(40) 


forOetcT 
|J-‘ 

for t>T 

(-UTicl 


A/*’ 

V*’ * 


Q-l)!(c-a-j)!a!(a+j) 

for t<0 
for t^O 


F '«-• (t) = l-/F.(k-t)dF^k) 

Sat 

F*(t) * ♦ (l-e'^ l )(l-|l-e' # * Bl |') + (l-e'^» l )(l-e" M3 > 


(41) 

(42) 

(43) 

(44) 


(45) 


Alio, 
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where 


. fort<0 

Hj(t) - jr^dHr^D-r^djr^t) for ixkt 
l for t>T 


(«) 


rut) ■ 1 - •‘“ r ‘ (4?) 

The expressions follow from a realization that when the mission lifetime Is 
given, the sojourn times are no longer Independent random variables. 


4.3. Numerical Results 

The parameters used for the process are r=7B, X,= .5 for all tasks, /x=20 
k=l, m=2, a=l, and R=1.75. The controller parameters are mp* 10"*. mb* 10"*. 
and mo = 10“* All tasks are assumed to be critical with the cost function of the 
task in the example in Section 4.1. 

For a system of this type, the only design variables are the number of pro- 
cessors the number of buses, and the speed of the hardware. The graphs that 
make up Figure 4 show the marginal benefit to be gained from the addition of 
processors. Addition of buses, it was found, has practically no impact on the pro- 
bability of dynamic failure or on the mean and variance costs. (Mean and vari- 
ance costs are equal in this system). 

The marginal benefit to be gained from raising the processor strength from 
2 to 3 is considerable; above a processor count of 4. the marginal benefits are 
practically non-existent for probability of dynamic failure. Similarly, above a 
processor count of 5, there is virtually no benefit to be accrued in adding pro- 
cessors. (Keep in mind that these re narks hold good only for mission lifetimes 
in the range considered.) The failure rate at the asymptotic level of infinite pro- 
cessors is principally due to the probability of working processors exceeding the 
hard deadline, and the mean and variance costs become the cost associated 
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with mean processor service time. If bat tar parformanca than that provldad by 
tha asymptotic level la raquired, It can only ba obtalnad by tha Introduction of 
faatar hardware. 


5. Discussion 

In thla paper, wa have presented performance indices that are objective in 
the way they measure performance. Due to this objectivity, they oan be used to 
advantage in both tha design and evaluation phases of the development of a con- 
troller system. 

As for design, the measures should help identify quasi-optimal architec- 
tures and operating systems. In the former category are included decisions 
regarding the interconnection structure and component speed differentials. In 
the latter category, we include the choice of schedule in the access of shared 
resources, the design of failure recovery procedures -especially the optimal 
placement of recovery points- and algorithms controlling the allocation and 
reallocation of tasks to the individual processors. The lack of objective indices so 
far has forced contemporary workers to employ overly simplistic performance 
indices, most notably in solving the task allocation problem (such as ascribing 
unit cost to each transaction on the interconnection network [24]. ) 

In the sphere of evaluation, the measures can be used to provide either an 
absolute or relative index to controller performance. In the former case, the 
performance functional, f„ has to be defined with particular care, and the 
indices provide a measure of the overhead cost incurred in control. A second 
important application is the comparison of rival controller systems for particu- 
lar applications. 

Acknowledgements 

The authors wish to thank Ricky Butler and Vllton Holt of the NASA Langley 


■ 82 - 


OWONAl PAGE » 

Of pnoo Aim mi 



M«nao%i M 


1. A. L. Hopkins, at. a l, "FTMP - A Highly Reliable Fault-Tolerant Multiprocessor 
for Aircraft”. Proc. IEEE, Vol. 66. No. 10, pp. 1221-1239, October 1976. 

2. J. H. Wenseley, at. a l, "SIFT - Software Implemented Fault Tolerance,” Pro c. 
IEEE, Vol. 66, No. 10, pp. 1240-1256, October 1976. 

3. M. J. Gonzalez and B. V. Jordan, ”* framework for the Quantitative Evaluation 
of Distributed Computet' Systems”, IEEE Trona. Cbmput.. Vol C-29, No. 12, Dec. 
1960, pp. 1067-1094. 

4. J. von Neumann, "Probabilistic logics and the Synthei is of Reliable Organisms 
from Unreliable Components”, in Automata Studiaa, pp. 43-98, Princeton Univer- 
sity Press, Princeton. NJ 1956. 

5. E. E. Moore and C. E. Shannon. "Reliable Circuits Ueing Less Reliable Relays”, 
/. Franklin Inst., Pt. I. Vol. 282, pp. 191-208, and Pt. II. pp. 281*297, 1958. 

6. W. G. Bourlcius, at. a l, "Reliability Modelling Techniques for Self-Repairing 
Computer Systems," Proc. ACM 1969 Annual Con/., pp. 295-309. August 1989. 

7. R. R. Borgerson and R. F. Frletas. "A Reliability Mcael for Gracefully Degrading 
and Standby-Sparing Systems." IEEE Trona. Comput., Vol. C-24, pp. 517-525, 
May 1975. 

8. G. N. Cherkesov, "Seml-Markovtan Models of Reliability of Multichannel Sys- 
tems with Unreplenlshable Reserve of Time," Enginaartng Cybamatica. Vol. 18, 
pp. 65-78, Mar/April 1980. 

9. J. Losq, "Effects of Failures on Gracefully Degradable Systems," Seventh Annu. 
Jnt 'l Conf. on Fault’Tolarxmt Computing, Los Angeles, CA. pp. 29-34, March 1977. 

10. R. Troy, "Dynamic Reconfiguration: An Algorithm and its Efficiency Evalua- 
tion." op. ci t, pp. 44-49. 

11. J. F. Meyer, at. oL. "Perform ability Evaluatlrn of the SIFT Computer." IEEE 
Trona. Comput., Vol. C-29, No. 6. pp. 501-509. June i960. 

12. M. D. Beaudry. "Performance-Related Reliability Measures for Computing 
Systems.” IEEE Truna. Comput., Vol. C-27, No. 6. pp. 540-547, June 1978 

13. J. F. Meyer, "On Evaluating the Perforraability of Degrading Computer Sys- 
tems,” IEEE Trona. Comput., V*l. C-20, pp. 720-731. August 1980. 




I 


% 


13. To maintain anonymity during the review proceee, reference* 20 and 22 are not inc. jded 
in this last. 


14. J. F. Meyer, "Closed-Form Solutions of Performeblllty." IEEE Trans. Cbmpuf., 
Vol. C-31, No. 7. pp. 648-657, July 1066. 

16. G. K. Manacher, "Production and Stabilisation cf Real-Time Task Schedules," 
/. ACM, Vol. 14. No. 3. July 1967, pp. 430-465. 

16. J. A. White, et. ai, "Multim.croprocessor Flight Control: System Architectural 
Concepts." Pro c. AIAA Comp. in Aerosp. Con/., pp. 67-02, October 1970. 

17. D. E. Kirk, Optimal Control Theory, Prentice Hall, Englewood Cliffs, NJ, 1970. 

IB. A. P. Sage, Optimum Systems Control, Prentice Hall, Englewood Cliffs, NJ, 
1077. 

19. M. Reiser and K. M. Chandy, eds., Cumputer Performance, North-Holland 
Publishing Co., Amsterdam, 1977. 

20. 22: (Our forthcoming papers: see Footnote 13.) 

21. S. J. Larimer and S. K. Maher. " The Continuously Reconfiguring Multiproces- 
sor," NATO-AGARD Meeting on Tactical Airborne Computing, Roros. Norway, 

1981. 

23. D. Gross and C. M. Harris. Fundamentals of Queueing Theory, John Wiley, 
New York. 1974. 

24. W. W. Chu, et. al, "Task AMocation in Distributed Data Processing," Compulsr. 
Vol. 13. No. 11. pp. 57-70. Nov ,980. 


ORIGINAL PAGZ 13 
OF POOR QUALITY 


| START 

I 



m t 

i 

y 

» — \ 

r 



5 *yn rwqirtrtmantsj 
\ met? / 



/ possible to > 
improve Pdyn 
swith a few systei 
\ changes ?/ 


Evaluate expression 

for S3 and P 


Carry out 
improve ments 


.Reject structure 
as unsuitable 


Update design based 
on sensitimty of p dyn . 
St, and V to design 
changes 


STOR 


' the new design s 
significantly 
better than the 
'preceding one/^ 


STOP 


Fig la: System Refinement 










PROCEDURE: COMPARISON (A) 

/• A={Ai A m J are the systems to be compared. •/ 

for i=l to m do 
begin 

CALL SELECTION (A) 

RANK(i) - SELECT 
A = A - (SELECT} 

end 

end COMPARISON: 
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PROCEDURE: SELECTION (A) 

/•SCj is the system capability of system Aj. 

A = (Ai A n { are the systems to be compared. V 

CHOSEN - 0 
for i= 1 to n do 
begin 

min (SCJ 
define MIN (i) = *-,44-1 B 1 J> 

define PROB(i) = Prob ( SCj ^ MIN(i) } 

if PROB(i) > CHOSEN then 

begin 

CHOSEN «- PROB(i) 

SELECT «- i 

end 

end 

end SELECTION; 


Fig lb: Comparing Rival Systems 








««1 



